Here we present the results of the integrated omics analysis using COSMOS. COSMOS integrates transcriptomics and metabolomics data using prior knowledge. This prior knowledge is a network, where nodes represent metabolites and proteins (e.g. kinases and metabolic enzymes) and signed, directed edges represent interactions among these entities. COSMOS builds on an optimization framework, that finds a subset of the original prior knowledge network that is in agreement with the data.
We estimated transcription factor activity from the transcriptomics data using Dorothea regulon and Viper algorithm. The TF activity served as input nodes for COSMOS. Then we defined the measured metabolites that changed significantly as outputs.
To explore multiple possible solutions for each samples, we (1) set in each COSMOS run to report multiple solutions within the given time limit (15 hours) and (2) we rerun the optimization 10 times with shuffled inputs.
Here we summarise the obtained results.
For each sample we run the COSMOS optimization 10 times, for 15 hours each, however in some cases the optimizaiton reached a memory limit of our cluster (128 GB), which resulted in loss of some results.
The figure below shows that for each sample, three to eight runs finished.
In each run we instructed the optimizer to enumerate multiple solutions, however, in most cases it found only a few (6-9) solutions. There are two samples, where many equivalent solutins were found.
The objective function measures the quality of each optimization. The objective function has two parts
therefore a smaller value is better.
## Warning: Removed 1 rows containing missing values (geom_point).
This figure shows that the repeated runs achieved similar objective functions. This give us confidence that they found a global optimal solution.
The samples colon_1000uM_24h and colon_100uM_24h have high objective function which means that the number of measurements are not well explained in those cases. We check the reason below.
Since the objective functions are similar in each sample, we can aggregate the multiple runs.
How many inputs and outputs are explained by the inferred networks?
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 5
## sample n_tfs modelled_tfs n_metabolites modelled_metabolites
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 colon_1000uM_24h 30 12.1 416 143.
## 2 colon_1000uM_48h 43 8.33 74 29.7
## 3 colon_1000uM_72h 43 7.67 92 35.7
## 4 colon_100uM_24h 17 9.86 409 137.
## 5 si_1000uM_24h 24 4 22 6.6
## 6 si_1000uM_48h 39 9.57 24 8.57
## 7 si_1000uM_72h 38 9.75 90 36.2
The above table shows how many of the transcription factors (n_tfs) and metabolites (n_metabolites) are explained in average by the COSMOS results (modelled_tfs and modelled_metabolites).
There are a few reasons why not all of them are included:
The above numbers explain why samples colon_1000uM_24 and colon_100uM_24 had high objective function value. These samples had 400+ metabolites as inputs but only 30% of them are included in the solutions.
Since the objective function values of the optimized models are similar, we can merge them. In the visualization we weight each edge by the time it appears across the solutions. This weight shows “essentiality” of the edges, i.e. the edges that disappears in some solutions are not essential to explain the data.
Note, if an edge appears in all solutions, it does not necessarily mean that there is no alternative explanation. It can happen that the optimizer didn’t find this alternative solution or the alternative solution includes more edges – our approch seeks for the smallest possible model.
The following figures show the consensus networks of each sample. Please note that you can zoom in to the network and rearrange the nodes if needed.
Legend:
Node type is encoded by the shape:
Colors:
Black border represents input and output nodes: estimated trascription factors and measured metabolites.
networks[[1]]
networks[[2]]
networks[[3]]
networks[[4]]
networks[[5]]
networks[[6]]
networks[[7]]
We clustered the samples based on the edges in their networks: rows represent the edges, samples are in the columns. The color shows the essentiality of the edge.
Early time SI samples are the most similar and late time colon samples also cluster together.
We filter for edges that appear across multiple samples to look for common mechanism. The following heatmap shows the edges for the samples that appears in at least 4 samples.
There are some interactions that appear exclusively for colon samples and some that appear across all the samples.
We clustered the samples based on the nodes in their networks: rows represent the nodes, samples are in the columns.
Early time SI samples are the most similar and late time colon samples also cluster together.
We filter for nodes that appear across multiple samples to look for common mechanism. The following heatmap shows the nodes for the samples that appears in at least 3 samples.
Which genes-sets are represented in results more than random? We apply over-representation analysis to answer this question. We use MSigDB gene sets: Hallmark genes and Canonical pathways for the analysis.
Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression.
Canonical pathways: Gene sets from pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts.
ora_hallmarks %>%
group_by(sample) %>% slice_min(n=25,order_by = `Adjusted p-value`) %>%
ggplot() + geom_col(aes(reorder(pathway,-`Adjusted p-value`),-log10(`Adjusted p-value`))) +
geom_hline(yintercept = -log10(0.05),color="red",linetype = 2) +
facet_wrap(~sample,scales = "free_y") + coord_flip() +theme_bw() + ggtitle("ORA using Hallmarks")
ora_canonical %>%
group_by(sample) %>% slice_min(n=25,order_by = `Adjusted p-value`) %>%
ggplot() + geom_col(aes(reorder(pathway,-`Adjusted p-value`),-log10(`Adjusted p-value`))) +
geom_hline(yintercept = -log10(0.05),color="red",linetype = 2) +
facet_wrap(~sample,scales = "free_y") + coord_flip() +theme_bw() +
ggtitle("ORA using Canonical Pathways")
HALLMARK_TNFA_SIGNALING_VIA_NFKB_genes <- msig_to_df(msigdb$MSIGDB_HMARKS) %>%
filter(geneset=="HALLMARK_TNFA_SIGNALING_VIA_NFKB")
# is there some genes that appears across all the samples?
consensus_cosmos_networks %>% select(sample,consensus_nodes) %>%
unnest(consensus_nodes) %>%
filter(clean_names %in% HALLMARK_TNFA_SIGNALING_VIA_NFKB_genes$gene) %>%
group_by(sample) %>%
summarise(genes_in_set = paste(clean_names,collapse = ", "),.groups="drop")
## # A tibble: 7 x 2
## sample genes_in_set
## <chr> <chr>
## 1 colon_1000uM_… CEBPB, KYNU, SLC2A3, NAMPT, PFKFB3, GCH1, HES1, IL1A, JUN, NFA…
## 2 colon_1000uM_… CEBPB, MYC, RELA
## 3 colon_1000uM_… CEBPB, FOS, JUN, NFAT5, NFKB1, RELA
## 4 colon_100uM_2… B4GALT1, CEBPB, KYNU, SLC2A3, GCH1, IL1A, MYC, NFAT5, NFE2L2, …
## 5 si_1000uM_24h RELA
## 6 si_1000uM_48h CEBPB, NFKB1, RELA
## 7 si_1000uM_72h BIRC3, CEBPB, FOS, FOSL1, MYC, NFAT5, NFKB1, RELA, SMAD3